Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

نویسندگان

  • Nongnuch Ketui
  • Thanaruk Theeramunkong
چکیده

There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unitgraph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDUbased summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Thai News Text Summarization and Its Application

Since Thai language lacks word/phrase/sentence boundaries, document summarization in Thai needs investigations in unit segmentation, unit selection, redundancy removal and evaluation dataset construction. In this work, we have proposed Thai Elementary Discourse Unit (TEDU) and a three-stage method of Thai multidocument summarization, i.e., unit segmentation, unit-graph formulation, and unit sel...

متن کامل

Timestamped Graphs: Evolutionary Models of Text for Multi-Document Summarization

Current graph-based approaches to automatic text summarization, such as LexRank and TextRank, assume a static graph which does not model how the input texts emerge. A suitable evolutionary text graph model may impart a better understanding of the texts and improve the summarization process. We propose a timestamped graph (TSG) model that is motivated by human writing and reading processes, and ...

متن کامل

A new graph based text segmentation using Wikipedia for automatic text summarization

The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of...

متن کامل

Towards Coherent Multi-Document Summarization

This paper presents G-FLOW, a novel system for coherent extractive multi-document summarization (MDS).1 Where previous work on MDS considered sentence selection and ordering separately, G-FLOW introduces a joint model for selection and ordering that balances coherence and salience. G-FLOW’s core representation is a graph that approximates the discourse relations across sentences based on indica...

متن کامل

ON THE REFINEMENT OF THE UNIT AND UNITARY CAYLEY GRAPHS OF RINGS

Let $R$ be a ring (not necessarily commutative) with nonzero identity. We define $Gamma(R)$ to be the graph with vertex set $R$ in which two distinct vertices $x$ and $y$ are adjacent if and only if there exist unit elements $u,v$ of $R$ such that $x+uyv$ is a unit of $R$. In this paper, basic properties of $Gamma(R)$ are studied. We investigate connectivity and the girth of $Gamma(R)$, where $...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computing and Informatics

دوره 35  شماره 

صفحات  -

تاریخ انتشار 2016